Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement an initial set of uniform DCGM GPU metrics in dcgmreceiver. #219

Merged
merged 39 commits into from
Sep 13, 2024

Conversation

igorpeshansky
Copy link
Member

This PR implements an initial set of DCGM GPU metrics from the uniform set defined in go/gce-gke-gpu-dcgm-metrics in dcgmreceiver.

In the process, it changes the GPU device-identifying metric attributes into resource attributes, adds support for older (and newer) GPU devices, and removes the assumption of 1:1 metric/DCGM field correspondence.

s.mb.RecordGpuDcgmPcieIoDataPoint(now, pcieRx, metadata.AttributeNetworkIoDirectionReceive)
}
if metric, ok := metrics["DCGM_FI_PROF_NVLINK_TX_BYTES"]; ok {
nvlinkTx := int64(float64(metric.asInt64()) * (s.config.CollectionInterval.Seconds())) /* rate to delta */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works if scrape() is called exactly once per collection interval.

It sounds like the API already gives you timestamps for each point, so you can walk the list of points and integrate each point with the corresponding timestamp. And you probably need to do that from a separate collection goroutine if it needs to happen more often than the collection interval.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, scrape() is called exactly once. I agree that the accuracy of the current approach is somewhat suboptimal, but I did fix a few bugs, and wasn't able to get the separate collection goroutine working. We can revisit this on the igorpeshansky-dcgm-new-metrics-parallel branch, but for now I'll stay with one scrape per collection.


// RecordGpuDcgmNvlinkIoDataPoint adds a data point to gpu.dcgm.nvlink.io metric.
func (mb *MetricsBuilder) RecordGpuDcgmNvlinkIoDataPoint(ts pcommon.Timestamp, val int64, networkIoDirectionAttributeValue AttributeNetworkIoDirection) {
mb.metricGpuDcgmNvlinkIo.recordDataPoint(mb.startTime, ts, val, networkIoDirectionAttributeValue.String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using the wrong start time if these are deltas. What happens when these incorrect delta points get written to Cloud Monitoring?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a bug in mdatagen. I've changed all of these metrics to cumulative, for which this start time is more reasonable.

@@ -39,44 +48,44 @@ func (m *dcgmMetric) asInt64() int64 {
return *(*int64)(unsafe.Pointer(&m.value[0]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the name implies, this is unsafe. Accessing a pointer in this way is undefined behavior in Go.

You want int64(binary.NativeEndian.Uint64(m.value[:8])), or for processing a whole slice at a time, binary.Read(...). Ditto for the three functions above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found an even more robust solution. There is no more usage of unsafe in this receiver. Can't speak for go-dcgm, though…

receiver/dcgmreceiver/metadata.yaml Show resolved Hide resolved
receiver/dcgmreceiver/client.go Show resolved Hide resolved
receiver/dcgmreceiver/scraper.go Outdated Show resolved Hide resolved
@igorpeshansky igorpeshansky force-pushed the igorpeshansky-dcgm-new-metrics branch from b1525f7 to 3e8243d Compare July 25, 2024 23:32
@igorpeshansky igorpeshansky force-pushed the igorpeshansky-dcgm-new-metrics branch 3 times, most recently from 94f6048 to b00e858 Compare July 27, 2024 19:58
@igorpeshansky igorpeshansky force-pushed the igorpeshansky-dcgm-new-metrics branch 2 times, most recently from 81f4014 to 4706b7a Compare July 28, 2024 07:36
@igorpeshansky igorpeshansky force-pushed the igorpeshansky-dcgm-new-metrics branch 3 times, most recently from c91caf6 to b779893 Compare July 28, 2024 13:27
igorpeshansky and others added 2 commits August 14, 2024 18:09
Metrics are now scraped independently of collector polling cycles, allowing higher resolution and better integrations.

Co-authored-by: Igor Peshansky <[email protected]>
@igorpeshansky igorpeshansky force-pushed the igorpeshansky-dcgm-new-metrics branch 2 times, most recently from 46b03ce to c148e0c Compare September 11, 2024 21:31
@igorpeshansky igorpeshansky force-pushed the igorpeshansky-dcgm-new-metrics branch from c148e0c to 5de8b1c Compare September 11, 2024 21:42
@igorpeshansky igorpeshansky merged commit f1f132c into master Sep 13, 2024
7 checks passed
@igorpeshansky igorpeshansky deleted the igorpeshansky-dcgm-new-metrics branch September 13, 2024 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants